1 00:00:00,025 --> 00:00:02,950 [SOUND] Hello, 2 00:00:05,200 --> 00:00:05,960 everyone. 3 00:00:05,960 --> 00:00:09,880 Welcome back to the Heterogeneous Parallel Programming class. 4 00:00:09,880 --> 00:00:13,190 This is lecture 1.3, Portability and 5 00:00:13,190 --> 00:00:16,680 Scalability in Heterogeneous parallel port, computing. 6 00:00:17,790 --> 00:00:21,350 The objective of this lecture is to help you to understand 7 00:00:21,350 --> 00:00:26,520 the importance and nature of scalability and portability in parallel programming. 8 00:00:28,920 --> 00:00:34,750 This slide shows the data that was published by IBM 9 00:00:34,750 --> 00:00:39,760 in 2010. And this graph shows that the hardware 10 00:00:39,760 --> 00:00:45,090 costs, and software costs have been both growing exponentially 11 00:00:45,090 --> 00:00:50,510 over the years. And this is a log plot, so a linear curve 12 00:00:50,510 --> 00:00:54,660 in this plot is a exponential curve. And the 13 00:00:54,660 --> 00:00:58,870 software cost is measured by lines per chip. 14 00:00:58,870 --> 00:01:03,600 And it has been growing by two times every ten months. 15 00:01:03,600 --> 00:01:06,930 And the hardware cost is measured by gates per chip. 16 00:01:06,930 --> 00:01:13,400 And this is meas- has been increasing by two times every 18 months. 17 00:01:13,400 --> 00:01:16,010 But therefore, the software cost has been 18 00:01:16,010 --> 00:01:19,180 growing much faster than the hardware cost. 19 00:01:19,180 --> 00:01:19,730 So even though 20 00:01:19,730 --> 00:01:27,120 the software cost started to be lower than hardware, but at this point in after 21 00:01:27,120 --> 00:01:31,470 2010 the software cost has essentially exceeded the 22 00:01:31,470 --> 00:01:34,440 hardware cost and it has been growing faster. 23 00:01:34,440 --> 00:01:37,530 So the software cost is going to be much, 24 00:01:37,530 --> 00:01:40,900 much more than hardware costs in the years to come. 25 00:01:40,900 --> 00:01:45,060 So in the future, all the system need to be able 26 00:01:45,060 --> 00:01:50,230 to minimize software development cost or redevelopment cost and 27 00:01:50,230 --> 00:01:55,500 this leads to the consideration of scalability and, and portability. 28 00:01:57,580 --> 00:02:01,770 The first aspect of software cost control is scalability. 29 00:02:01,770 --> 00:02:05,245 If we develop an application to run well on Core A, 30 00:02:05,245 --> 00:02:07,970 what we would like to do is, we would like to make 31 00:02:07,970 --> 00:02:13,400 sure that that same application without significant re-, redevelopment 32 00:02:13,400 --> 00:02:19,246 can run efficiently on the next version of Core A, let's say Core A 2.0. 33 00:02:19,246 --> 00:02:22,580 So this allows us to to use 34 00:02:22,580 --> 00:02:27,288 the same application when a new generation of hardware is introduced. 35 00:02:27,288 --> 00:02:32,090 And when, when, whenever scalability holds then the 36 00:02:32,090 --> 00:02:35,790 developer does not need to re-revise the hard-, the 37 00:02:35,790 --> 00:02:37,720 software in order to run well on the 38 00:02:37,720 --> 00:02:42,070 new generation of software, therefore reducing the redevelopment cost. 39 00:02:44,460 --> 00:02:47,660 There is another dimension of Scalability. 40 00:02:47,660 --> 00:02:53,087 Whenever an application runs well on one Core A, we will also like it to 41 00:02:53,087 --> 00:02:57,436 to run well on multiples of these Core As or more of a same cores. 42 00:02:57,436 --> 00:03:00,316 And this allows us to to, to add 43 00:03:00,316 --> 00:03:05,149 performance by adding more hardware into the system. 44 00:03:05,149 --> 00:03:10,369 Many in many situations, the vendors would like to introduce 45 00:03:10,369 --> 00:03:14,893 multiple versions of hardware and each version will have 46 00:03:14,893 --> 00:03:19,630 increasingly more amount of hardware available to the users. 47 00:03:19,630 --> 00:03:22,886 So if we could develop a piece of software that is 48 00:03:22,886 --> 00:03:25,920 scalable of to run well on more of the same cores, 49 00:03:25,920 --> 00:03:30,656 then this gives the vendor the the scalability of their hardware, 50 00:03:30,656 --> 00:03:35,466 so that when they introduce more hardware the users can actually observe 51 00:03:35,466 --> 00:03:38,530 increased performance from the application. 52 00:03:41,150 --> 00:03:46,790 In the future, we expect that there will be several generations of hardware 53 00:03:46,790 --> 00:03:52,160 where performance will be increased by adjusting many of these parameters. 54 00:03:52,160 --> 00:03:54,830 For example, the number of compute units or 55 00:03:54,830 --> 00:03:58,250 the number of cores, and the number of threads, 56 00:03:58,250 --> 00:04:01,124 the number of the increasing vector length, the 57 00:04:01,124 --> 00:04:06,550 increased increased pipeline depth, and increased DRAM burst size, 58 00:04:06,550 --> 00:04:12,630 increased number of DRAM channels, and increasing data movement latency. 59 00:04:12,630 --> 00:04:18,400 All these hardware parameters will can significantly affect the 60 00:04:18,400 --> 00:04:23,780 performance of the application, and oftentimes applications need to be able 61 00:04:23,780 --> 00:04:29,680 to to be tuned, choose some settings of these parameters. 62 00:04:29,680 --> 00:04:31,740 So, but the programming 63 00:04:31,740 --> 00:04:37,758 style that we use in this le-, course addresses these needs by supporting 64 00:04:37,758 --> 00:04:42,610 fine-grained problem decomposition and dynamic thrust scheduling. 65 00:04:42,610 --> 00:04:45,490 So that the application that you write, according 66 00:04:45,490 --> 00:04:48,090 to this programming style, will be able to 67 00:04:48,090 --> 00:04:52,140 automatically adjust to a fairly wide range of 68 00:04:52,140 --> 00:04:56,530 parameter values that the hardware venders may change. 69 00:04:56,530 --> 00:05:02,106 So this allows your application to to run well on one generation 70 00:05:02,106 --> 00:05:07,070 of hardware and continue to run well on a future generation of hardware. 71 00:05:07,070 --> 00:05:11,630 And also if your application runs well on one of, of the cores, you 72 00:05:11,630 --> 00:05:16,100 can expect the application to also run well on more of the same cores. 73 00:05:18,790 --> 00:05:23,840 The second dimension of software cost control is portability. 74 00:05:23,840 --> 00:05:31,460 Portability is defined as if we develop an application to run well on 75 00:05:31,460 --> 00:05:37,550 core A, we would also like it to be able to run well on different types of cores. 76 00:05:37,550 --> 00:05:41,650 In this case, core B and core C in the picture. 77 00:05:41,650 --> 00:05:43,480 So oftentimes, 78 00:05:43,480 --> 00:05:48,212 the application developed for core A may 79 00:05:48,212 --> 00:05:53,090 be initially running on one vendor product. 80 00:05:53,090 --> 00:06:00,870 And but if the application is portable then the users can expect to run the 81 00:06:00,870 --> 00:06:04,400 same application on different hardware types, oftentimes 82 00:06:04,400 --> 00:06:08,100 from different hardware vendors to also run well. 83 00:06:08,100 --> 00:06:09,380 So these, 84 00:06:09,380 --> 00:06:14,210 this can also decrease the software cost because the 85 00:06:14,210 --> 00:06:18,840 developer will not need to redevelop or revise their application so 86 00:06:18,840 --> 00:06:22,710 that the application can run well on other types of vendors systems. 87 00:06:24,540 --> 00:06:29,740 And a lot of times we will see different design styles from 88 00:06:29,740 --> 00:06:34,970 different vendors. For GPUs, oftentimes we will see very 89 00:06:34,970 --> 00:06:39,050 significant design styles, as illustrated in this picture. 90 00:06:39,050 --> 00:06:44,601 And in, in terms of the particular kinds of differences, 91 00:06:44,601 --> 00:06:50,061 we can expect to see that for the CPU cores we can, we would 92 00:06:50,061 --> 00:06:55,521 see different instructions and architecture such as X86 versus 93 00:06:55,521 --> 00:07:01,230 ARM versus other types of instructions set architectures. 94 00:07:01,230 --> 00:07:05,374 And these iinstructions and architectures oftentimes would 95 00:07:05,374 --> 00:07:09,790 required different compiler co-generation and so on. 96 00:07:09,790 --> 00:07:13,580 So and that, that oftentimes it 97 00:07:13,580 --> 00:07:16,650 will affect the portability of your application. 98 00:07:16,650 --> 00:07:21,690 And we also have different design styles even based on 99 00:07:21,690 --> 00:07:26,480 the same instructions and architecture, we could have Latency oriented CPU 100 00:07:26,480 --> 00:07:30,220 designs versus throughput oriented GPU designs. 101 00:07:30,220 --> 00:07:35,690 And so when we develop a a piece of code, 102 00:07:35,690 --> 00:07:40,420 can we expect to run we, can we expect it to run well on a 103 00:07:41,625 --> 00:07:45,968 latency-oriented CPU if it runs well on a 104 00:07:45,968 --> 00:07:51,690 through-put oriented GPU? And the third kind of dimension 105 00:07:51,690 --> 00:07:54,900 of portability often comes with different 106 00:07:54,900 --> 00:07:58,980 styles of parallelism in the processor core. 107 00:07:58,980 --> 00:08:01,780 There's a design style called VRW and there is 108 00:08:01,780 --> 00:08:06,122 a design style SIMD and there's a design style multi-threading. 109 00:08:06,122 --> 00:08:11,680 So we're going to actually touch quite a bit on this level 110 00:08:11,680 --> 00:08:16,770 of differences for hardware. And then it also will come 111 00:08:16,770 --> 00:08:20,250 to how we organize DRAMs in the system, whether 112 00:08:20,250 --> 00:08:24,380 it's a shared memory model or distributed memory model. 113 00:08:24,380 --> 00:08:31,520 These all these dimensions would affect the portability of your application. 114 00:08:31,520 --> 00:08:36,570 So when we work towards the end of the course, we will 115 00:08:36,570 --> 00:08:41,520 have, will introduce emerging standards such as OpenCL and 116 00:08:41,520 --> 00:08:44,440 Heterogeneous System Architecture, that will help 117 00:08:44,440 --> 00:08:46,880 to address the portability of your applications. 118 00:08:49,550 --> 00:08:53,689 So at this point, we have completed all the the 119 00:08:53,689 --> 00:08:58,240 high-level introduction and complex information lectures. 120 00:08:58,240 --> 00:09:03,310 So starting from next lecture we are going to be introducing the CUDA programming 121 00:09:03,310 --> 00:09:09,098 interface and begin to to help you to develop your lab 122 00:09:09,098 --> 00:09:14,630 application assignments. So for those of you who like to learn 123 00:09:14,630 --> 00:09:19,270 more about the topic and the context information I would 124 00:09:19,270 --> 00:09:23,440 like to encourage you to read chapter 1 of the textbook. 125 00:09:23,440 --> 00:09:23,820 Thank you.